ITU Validation Set for Metu-Sabancı Turkish Treebank
نویسنده
چکیده
The Turkish Treebank (Oflazer et al., 2003; Atalay et al., 2003) created by the Middle East Technical University and Sabancı University is available to the researchers since 2003 and it is used by many researchers since then (Eryiğit and Oflazer, 2006; Eryiğit et al., 2006b; Eryiğit et al., 2006a; Nivre et al., 2007; Çakıcı and Baldridge, 2006; Buchholz and Marsi, 2006; Yüret, 2006; Wu et al., 2006; Dreyer et al., 2006; Shimizu, 2006; Schiehlen and Spranger, 2006; Riedel et al., 2006; Johansson and Nugues, 2006; McDonald et al., 2006; Liu et al., 2006; Chang et al., 2006; CorstonOliver and Aue, 2006; Cheng et al., 2006; Carreras et al., 2006; Canisius et al., 2006; Bick, 2006; Attardi, 2006; Eryiğit, 2006). Although it has some inconsistencies and still continues to be updated with newer versions, it served very much in the recent years for the development of the research on dependency parsing of Turkish. The Turkish treebank is composed of 5635 sentences and annotated with dependency structures. The modest data size of the treebank has been mentioned in many studies (Nivre et al., 2007; Buchholz and Marsi, 2006). There is no need to say that the size should be increased for better research on the field, but we should also state that the small size of the number of words (48K) of this treebank can be actually related to one of the features of the language itself. In the treebank, the average number of words in a sentence is 8.6 which is very lower when compared to other languages. This is since in Turkish, the words are sometimes equivalent to a whole sentence in another language which is a result of its agglutinative structure. This property of the language makes look the treebank smaller than it is when compared
منابع مشابه
Morpheme Segmentation in the METU-Sabancı Turkish Treebank
Morphological segmentation data for the METU-Sabancı Turkish Treebank is provided in this paper. The generalized lexical forms of the morphemes which the treebank previously lacked are added to the treebank. This data maybe used to train POS-taggers that use stemmer outputs to map these lexical forms to morphological tags.
متن کاملRevising the METU-Sabancı Turkish Treebank: An Exercise in Surface-Syntactic Annotation of Agglutinative Languages
In this paper, we present a revision of the training set of the METU-Sabancı Turkish syntactic dependency treebank composed of 4997 sentences in accordance with the principles of the Meaning-Text Theory (MTT). MTT reflects the multilayered nature of language by a linguistic model in which each linguistic phenomenon is treated at its corresponding level(s). Our analysis of the METU-Sabancı synta...
متن کاملTransition-based Dependency DAG Parsing Using Dynamic Oracles
In most of the dependency parsing studies, dependency relations within a sentence are often presented as a tree structure. Whilst the tree structure is sufficient to represent the surface relations, deep dependencies which may result to multi-headed relations require more general dependency structures, namely Directed Acyclic Graphs (DAGs). This study proposes a new dependency DAG parsing appro...
متن کاملUse of Lexical Statistics for Compound Word Recognition and Segmentation in Turkish
Compound words are cross-linguistic morphological phenomena that occur in all languages. Compound words are widely accepted to be stored in the lexicon but their constituents need to be accessed during both language learning and production processes. In this study, the use of corpora was investigated for how to differentiate single-stem words from single-word compounds and then how to segment c...
متن کاملMarmara Turkish Coreference Corpus and Coreference Resolution Baseline
We describe the Marmara Turkish Coreference Corpus, which is an annotation of the whole METU-Sabanci Turkish Treebank with mentions and coreference chains. Collecting nine or more independent annotations for each document allowed for fully automatic adjudication. We provide a baseline system for Turkish mention detection and coreference resolution and evaluate it on the corpus.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007